Introduction:

Researchers strive to be able to accurately account for as much variance in any given data as possible as a means to ensure that any given analysis or machine learning task meets a certain desired level of accuracy. Often times, this can be a very challenging task, especially when faced with data with high degrees of uncertainty from sources that are difficult to identify, quantify, and mitigate. In this research project, we have created an orthogonal “coordinate system” to describe the socioeconomic makeup of a large portion of the counties in the United States. The coordinate system given here serves as a non-spatially dependent set of points which can be used to locate a particular county in terms of its underlying population instead of more traditional methods such as latitude and longitude. To do this, we will first collect, compile, and engineer data from several sources. Next, we will perform a Principal Components Analysis (PCA) to extract our two new feature dimensions. Finally, we will display the results of our reduced-dimension data, and attempt to predict the results of the 2020 election using only our two coordinates as predictors. This final step of the analysis will serve as a means to validate the coordinate system as one which accurately describes each county.

Using this newly formed system, researchers will be armed with a tool that can help them account for and mitigate variance and uncertainty in their data. This will help strengthen accuracy of predictions and statistical analyses in areas of study which rely heavily on socioeconomic and demographic information.

Data Collection and Processing:

The data set represents demographic data collected and combined from several sources, including; Unacast COVID-19 data, census data, religious data, and additional political and demographic data. This data set attempts to combine data from many sources into a single cohesive aggregate set of predictors. In discovering data, it was important to use data that was as complete as possible, represented each county, was recent enough to be valid, and covered as much of the underlying variance as possible. Additionally, some features required aggregation and transformation, and additional features were engineered to provide a comprehensive set of features from which to perform the PCA. Below, the data dictionary for the final data is given:

Variable Description
FIPS Federal Information Processing Standards code for the county
county_population_density population per square mile
daily_distance_diff minimum measurement of difference in daily visitation
encounters_rate minimum measurement of difference in daily encounters
avg_hh_size size of average household (in people)
living_alone percentage of residents living alone
bach_degree percentage of residents who have completed a bachelor’s degree
grad_degree percentage of residents who have completed a graduate degree
internet_access percentage of residents who have broadband internet access
mhhi median household income (in USD)
mean_min_to_work mean commute to work (in minutes)
gini county gini index
unemployment percent of residents unemployed
median_age median age of residents (in years)
perc_dem_2016 percentage of vote to Democrats in 2016
perc_gop_2016 percentage of vote to Republicans in 2016
perc_obesity percentage of residents who are obese
establishments number of business establishments
employees number of total employees
payroll total payroll (in USD)
TOTCNG total number of religious congregations
TOTRATE number of residents that are religious adherents per 1000
White percent of the population that self identify as White
Black percent of the population that self identify as Black
Hispanic percent of the population that self identify as Hispanic
Asian percent of the population that self identify as Asian
Amerindian percent of the population that self identify as American Indian
Other percent of the population that self identify as other
Children.Under.6.Living.in.Poverty percentage of children under 6 living in poverty
Children.in.single.parent.households percentage of children living in single parent households
Adults.65.and.Older.Living.in.Poverty percentage of adults over 65 living in poverty
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 percentage of 3 and 4 year old children enrolled in preschool
Uninsured proportion of population without health insurance
party party with higher proportion of vote in 2016

Additionally, we will take a look at the kind of data we are dealing with in terms of data type and some example observations for each of the predictors. From this, we can see that 3,034 of the 3,142 counties are represented. The exclusion of certain counties is due to the fact that there are important features which sparsely contained values of N/A, with no reliable means of imputation. From our original data sources, we have compiled and engineered 35 features which we believe accurately describe the socioeconomic and demographic makeup of each county in a way that enables us to continue with the creation of our feature dimensions via PCA. The features selected are all numeric, with the exception of party, which will be used purely for assistance in visualization of the dimensions that are plotted when we analyze our principal components. It is important to note that PCA requires a standardization of the data, which will be done at the time the data is preprocessed for the PCA just before it is fed into the model. However, it must be considered nonetheless. We will use z-standardization, \(Z = \frac{y_i-\mu}{s}\), where \(s = \frac{\sigma}{\sqrt{n}}\). These \(Z\)-scores transforms each predictor to a scale with mean \(\mu = 0\), and \(\sigma = 1\). Hence, all observations will express the number of standard deviations from the mean, 0, each point exists. This is necessary to ensure that the magnitude of any individual variable does not cause that variable to become more “important” than any other. If each of our predictors were all in the same unit of measure, this would not be necessary. From our data dictionary, and the table below, we can see that this process will be crucial to our model.

## Rows: 3,034
## Columns: 35
## $ FIPS                                                    <dbl> 1001, 1003, 10…
## $ county_population_density                               <dbl> 93.60438, 137.…
## $ daily_distance_diff                                     <dbl> -0.5587500, -0…
## $ encounters_rate                                         <dbl> -0.8235534, -0…
## $ avg_hh_size                                             <dbl> 2.59, 2.61, 2.…
## $ living_alone                                            <dbl> 25.1, 30.4, 32…
## $ bach_degree                                             <dbl> 15.9, 20.7, 7.…
## $ grad_degree                                             <dbl> 11.8, 10.6, 4.…
## $ internet_access                                         <dbl> 78.9, 78.1, 60…
## $ mhhi                                                    <int> 58786, 55962, …
## $ mean_min_to_work                                        <dbl> 25.8, 27.4, 22…
## $ gini                                                    <dbl> 0.4602, 0.4609…
## $ unemployment                                            <dbl> 2.5, 2.6, 4.4,…
## $ median_age                                              <dbl> 37.8, 42.8, 39…
## $ perc_dem_2016                                           <dbl> 24.0, 19.6, 46…
## $ perc_gop_2016                                           <dbl> 73.4, 77.4, 52…
## $ perc_obesity                                            <dbl> 30.9, 26.7, 40…
## $ establishments                                          <int> 851, 5235, 452…
## $ employees                                               <int> 10790, 61341, …
## $ payroll                                                 <int> 332497, 196015…
## $ TOTCNG                                                  <dbl> 106, 271, 89, …
## $ TOTRATE                                                 <dbl> 676.8789, 531.…
## $ White                                                   <dbl> 77.55, 84.30, …
## $ Black                                                   <dbl> 17.70, 9.55, 4…
## $ Hispanic                                                <dbl> 2.20, 3.65, 4.…
## $ Asian                                                   <dbl> 0.75, 0.65, 0.…
## $ Amerindian                                              <dbl> 0.40, 0.60, 0.…
## $ Other                                                   <dbl> 1.35, 1.25, 0.…
## $ Children.Under.6.Living.in.Poverty                      <dbl> 21.15, 20.90, …
## $ Child.Poverty.living.in.families.below.the.poverty.line <dbl> 14.00, 19.10, …
## $ Children.in.single.parent.households                    <dbl> 0.301, 0.279, …
## $ Adults.65.and.Older.Living.in.Poverty                   <dbl> 8.75, 5.20, 21…
## $ Preschool.Enrollment.Ratio.enrolled.ages.3.and.4        <dbl> 42.3, 43.0, 32…
## $ Uninsured                                               <dbl> 0.139, 0.166, …
## $ party                                                   <chr> "gop", "gop", …

The correlation matrix below shows us which variables are related to others, as well as uncovers certain trends in our data. This will be a useful tool in our analysis of the feature dimensions that we will extract from our PCA. Looking at the plot, we can see some things that our intuition would suggest. For instance, The associations between uninsured and the other covariates follows a relationship that we would expect; education, lack of poverty, and in some cases, race all share a negative association with lacking insurance. county_population_density also shares associations that are reasonable. Because the PCA technique creates linearly independent components, we do not need to be concerned about any multicollinearity in our data. This plot does serve as a good means by which we can verify the reasonableness of our data set. It gives us a good sense of whether our data is likely accurate, especially because of the means by which it was built from several distinct data sets from different sources.

Principal Components Analysis:

In this section, we will perform the Principal Components Analysis on the data. As discussed previously, this technique creates a set of \(p-1\) principal components from \(p\) original predictors. Each one of the resultant PC’s are linearly independent from one another, making them a great tool for use in modeling techniques that require independence of predictors. From these PC’s we have decided to choose only the first two, which will represent our transformed coordinate system, though generally, this choice for number of components to be used would occur after the PCA is fit. The summary below shows us the proportion of variance in the data that each of the PC’s can “account” for. In our case, PC1 accounts for about 22.72% of the variance in the data, and PC2 accounts for about 18.93% of the variance in the data. Generally speaking, the number of PC’s selected would be based on the cumulative proportion of variance explained, with the goal being to choose a threshold of variance, say 90%, and choose the number of PC’s necessary to cumulatively account for that. Our choice of 2 PC’s leaves our cumulative explained variance low, being only at about 41.65%. However, we have provided justification for our choice based on our needs, and feel that this is sufficient.

pca_features <- data_county %>% select(-c(FIPS,party))
pca <- prcomp(pca_features, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.7383 2.4990 1.65672 1.55779 1.3436 1.19476 1.15651
## Proportion of Variance 0.2272 0.1893 0.08317 0.07354 0.0547 0.04326 0.04053
## Cumulative Proportion  0.2272 0.4165 0.49965 0.57318 0.6279 0.67115 0.71168
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.10702 0.96699 0.94637 0.91057 0.87156 0.80403 0.76147
## Proportion of Variance 0.03714 0.02834 0.02714 0.02513 0.02302 0.01959 0.01757
## Cumulative Proportion  0.74881 0.77715 0.80429 0.82941 0.85243 0.87202 0.88959
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.73449 0.68437 0.61083 0.57183 0.55899 0.52670 0.48666
## Proportion of Variance 0.01635 0.01419 0.01131 0.00991 0.00947 0.00841 0.00718
## Cumulative Proportion  0.90594 0.92013 0.93144 0.94135 0.95082 0.95922 0.96640
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.46815 0.44056 0.42874 0.37304 0.36191 0.28981 0.28204
## Proportion of Variance 0.00664 0.00588 0.00557 0.00422 0.00397 0.00255 0.00241
## Cumulative Proportion  0.97304 0.97892 0.98449 0.98871 0.99268 0.99522 0.99763
##                           PC29    PC30    PC31    PC32     PC33
## Standard deviation     0.19610 0.15881 0.09386 0.07468 0.001732
## Proportion of Variance 0.00117 0.00076 0.00027 0.00017 0.000000
## Cumulative Proportion  0.99880 0.99956 0.99983 1.00000 1.000000

The scree plot below is also generally used to choose the number of principal components that will be retained as features. In addition to the cumulative variance, from the scree plot, we would generally also like to choose the number of components that fall at around the “elbow” in the plot, or where the rate of change from one component to the next varies the largest. Here, you can see that the first elbow occurs at PC2, with additional smaller elbows at PC6 and PC9. This is encouraging, because we can see that our two PCs that we will retain have a much greater importance in relation to all of the others, and share a similar amount of importance with each other.

fviz_eig(pca, main='Scree plot: COIVD Feature Selection', addlabels=TRUE)

The plot below helps us visualize the contribution of individual predictors from the data set to each of our retained components. From this, we can see that race, age, and political affiliations have a strong effect on the first dimension, while education and other economic factors have a strong effect on the second dimension.

fviz_pca_var(pca)

The table below shows the importance of each predictor in relation to each of the dimensions. Here, the magnitude of the predictor value indicates the importance of the predictor to the dimension. This table verifies the conclusions drawn from the plot below.

pca$rotation[,1:2]
##                                                                 PC1
## county_population_density                                0.14626900
## daily_distance_diff                                     -0.11395330
## encounters_rate                                          0.13863949
## avg_hh_size                                              0.01705352
## living_alone                                            -0.07376973
## bach_degree                                              0.29947755
## grad_degree                                              0.27651886
## internet_access                                          0.28945891
## mhhi                                                     0.30097640
## mean_min_to_work                                         0.02778861
## gini                                                    -0.06185088
## unemployment                                            -0.07638630
## median_age                                              -0.05681742
## perc_dem_2016                                            0.13452180
## perc_gop_2016                                           -0.15576303
## perc_obesity                                            -0.23887569
## establishments                                           0.22778411
## employees                                                0.22772149
## payroll                                                  0.22191656
## TOTCNG                                                   0.20404150
## TOTRATE                                                 -0.02987051
## White                                                    0.03415251
## Black                                                   -0.09054936
## Hispanic                                                 0.02205491
## Asian                                                    0.23360103
## Amerindian                                              -0.04858233
## Other                                                    0.08123726
## Children.Under.6.Living.in.Poverty                      -0.23493000
## Child.Poverty.living.in.families.below.the.poverty.line -0.24392850
## Children.in.single.parent.households                    -0.14764205
## Adults.65.and.Older.Living.in.Poverty                   -0.21638853
## Preschool.Enrollment.Ratio.enrolled.ages.3.and.4         0.08291977
## Uninsured                                               -0.16725984
##                                                                  PC2
## county_population_density                                0.160135806
## daily_distance_diff                                     -0.005550522
## encounters_rate                                          0.155779941
## avg_hh_size                                              0.132723242
## living_alone                                             0.069852316
## bach_degree                                             -0.021929500
## grad_degree                                              0.051063853
## internet_access                                         -0.118687996
## mhhi                                                    -0.102253121
## mean_min_to_work                                         0.065794823
## gini                                                     0.240179563
## unemployment                                             0.231406610
## median_age                                              -0.146496523
## perc_dem_2016                                            0.269608366
## perc_gop_2016                                           -0.246211641
## perc_obesity                                             0.046421939
## establishments                                           0.199306440
## employees                                                0.206330225
## payroll                                                  0.201850662
## TOTCNG                                                   0.206626230
## TOTRATE                                                 -0.013661634
## White                                                   -0.328784043
## Black                                                    0.259542950
## Hispanic                                                 0.138147773
## Asian                                                    0.132013364
## Amerindian                                               0.067358817
## Other                                                    0.048865074
## Children.Under.6.Living.in.Poverty                       0.211034022
## Child.Poverty.living.in.families.below.the.poverty.line  0.239800946
## Children.in.single.parent.households                     0.283520479
## Adults.65.and.Older.Living.in.Poverty                    0.178520696
## Preschool.Enrollment.Ratio.enrolled.ages.3.and.4         0.084328725
## Uninsured                                                0.145670252

The plots below show each of our counties now plotted on our transformed coordinate system, and colored by the party affiliation from the 2016 election. We can see that there appears to be a distinct clustering effect based on political affiliation, with some overlap at what we would consider to be a decision boundary. In our final verification step, we will try to predict the party affiliation of the 2020 election cycle based only on our 2 dimensions. The second plot shows the predictors that affected the position of the points the greatest. These are presented as vectors with the direction representing the direction and the magnitude representing the strength of the effect.

Here we have created a bivariate cloropleth map. First, our counties are classified into one of 9 classes and assigned a color based on their class. This is displayed in the first visualization. Next, we have overlaid each of the counties onto a map of the United States in order to see if any trends appear. We first can see that although the groups are not well separated, there is a distinct pattern in the way that counties are grouped together. For instance, you can see that areas that are population centers are much different than areas that are rural. Additionally, we can see that areas along the coast, such as New York, LA, SF, Seattle, Miami, etc are grouped together. Rural areas in the Midwest share similar colors, as do areas in the Southeast, with population centers being much different than the rural areas. This suggests that the PCA is picking up not only sheer population and economic factors, but a hint of ideology as well.

Results and Discussion:

We will now add the election data from the 2020 election cycle in order to create a classifier, and make predictions. The strength of the classifier will help us understand if the components chosen from our PCA actually do represent some meaningful coordinate system.

election_2020 <- read.csv("/mnt/18246CDB246CBCFE/ProjectData/unacast/2020election.csv")
data_county <- merge(data_county, post_pca, by = "FIPS")
election_results <- election_2020 %>% mutate(party_2020 = case_when(per_gop-per_dem > 0 ~ 'gop',
                                                               per_gop-per_dem < 0 ~ 'dem')) %>%
  select(county_fips,party_2020) %>% rename(FIPS = county_fips)

data_county <- inner_join(data_county,election_results,BY="FIPS")
## Joining, by = "FIPS"

Suport Vector Classifier:

For this step, we have selected a Support Vector Classifier (SVC). This classifier was chosen because of the clear clustering effect that was seen on the visualizations of the counties in their new coordinate system, and what we can interpret as a very clear decision boundary from the 2016 election results. The code below is included to show how the target party_2020 was created and added to the feature data which will be used in the SVC. We have chosen to partition the data into a 70/30 training/testing split, so that we can validate our results and view predictions on unseen data. SVC’s are a machine learning technique in which a loose hyper-plane is created, often referred to as a “decision boundary”. Points falling on either side of the hyper-plane are classified within the respective class that the boundary seeks to bisect, with some room for points within a marginal distance to be evaluated and assigned a class prediction on a case by case basis.

data_county$party_2020 <- as.factor(data_county$party_2020)
features <- data_county %>% select(c(Dim.1, Dim.2,party_2020,FIPS))

training_index <- createDataPartition(features$party_2020,
                                      times = 1,
                                      p = 0.7,
                                      list = FALSE)

features_train <- features[training_index,]
features_test <- features[-training_index,]

We have trained a linear Support Vector Classifier, meaning that the choice of hyper-plane is linear in nature. This was chosen due to the linearity we saw in our clusters when analyzing the dimensions from the PCA. Viewing the results of the trained model shows us that it is getting about 91.93% accuracy, with a Kappa, accuracy beyond what could be considered random chance, of about 68.41%.

svm_linear <- train(party_2020 ~ Dim.1 + Dim.2,
                data = features_train, 
                method = "svmLinear")

svm_linear$results
##   C  Accuracy     Kappa  AccuracySD    KappaSD
## 1 1 0.9192877 0.6840868 0.008263236 0.03604089

When tested on the unseen data, we are seeing an accuracy of about 93.23%. We can also say with about 95% confidence that our model would classify at between 91.39% and 94.78% on data like what we see here. The p-value of the model (<2.2e-16) tells us that our model is statistically significant. We have chosen the “positive” class here to be “gop” because of the prevalence of observations in that class in our data. Our high sensitivity (True Positive Rate) tells us that our classifier has done a good job with observations in this class. We will note that our specificity (True Negative Rate) tells us that our model is struggling slightly with observations in the “dem” class.

features_test <- features_test %>%  mutate(preds = predict(svm_linear, newdata = features_test))

svm_predictions <- predict(svm_linear, features_test)

confusionMatrix(data = svm_predictions,
                reference = features_test$party_2020,
                positive = "gop")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dem gop
##        dem 106  16
##        gop  45 734
##                                           
##                Accuracy : 0.9323          
##                  95% CI : (0.9139, 0.9478)
##     No Information Rate : 0.8324          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7372          
##                                           
##  Mcnemar's Test P-Value : 0.000337        
##                                           
##             Sensitivity : 0.9787          
##             Specificity : 0.7020          
##          Pos Pred Value : 0.9422          
##          Neg Pred Value : 0.8689          
##              Prevalence : 0.8324          
##          Detection Rate : 0.8147          
##    Detection Prevalence : 0.8646          
##       Balanced Accuracy : 0.8403          
##                                           
##        'Positive' Class : gop             
## 
features_test <- features_test %>% mutate(svm_pred = svm_predictions)

From the table below, we can see that the model is very imbalanced, so this result is to be expected.

features %>% group_by(party_2020) %>% summarise(n = n())
## # A tibble: 2 x 2
##   party_2020     n
##   <fct>      <int>
## 1 dem          505
## 2 gop         2502

Naive Bayes Classifier:

We have also chosen to perform the same predictions using a Naive Bayes Classifier. Naive Bayes Classifiers assign class membership via a probabilistic model which uses Bayes’ Theorem in order to evaluate the likelihood of each point being within a certain class, given the data. Additionally, the main assumption of independence between the predictors makes a Naive Bayes Classifier a good choice for a wide range of classification tasks, although, we can see here that with training accuracy of 90.25%, and testing accuracy of 90.68%, this classifier is outperformed by the SVM. It is also a slightly less significant model, though the p-value (9.951e-11) still falls well below the significance mark at the 95% significance level.

nb_model <- train(party_2020 ~ Dim.1 + Dim.2,
                  data = features_train,
                  method = "naive_bayes")

nb_model$results
##   usekernel laplace adjust  Accuracy     Kappa  AccuracySD    KappaSD
## 1     FALSE       0      1 0.9025155 0.6160957 0.007795281 0.03049385
## 2      TRUE       0      1 0.9006321 0.5948432 0.009192761 0.03856406
nb_model$finalModel
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.default(x = x, y = y, laplace = param$laplace, usekernel = FALSE)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##       dem       gop 
## 0.1680912 0.8319088 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: Dim.1 (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## Dim.1         dem        gop
##   mean  2.3543217 -0.4254582
##   sd    4.3645226  1.8467886
## 
## --------------------------------------------------------------------------------- 
##  ::: Dim.2 (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## Dim.2         dem        gop
##   mean  2.6806171 -0.6344432
##   sd    3.4376027  1.6513122
## 
## ---------------------------------------------------------------------------------
nb_predictions <- predict(nb_model, features_test)

confusionMatrix(data = nb_predictions,
                reference = features_test$party_2020,
                positive = "gop")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dem gop
##        dem  99  32
##        gop  52 718
##                                          
##                Accuracy : 0.9068         
##                  95% CI : (0.8859, 0.925)
##     No Information Rate : 0.8324         
##     P-Value [Acc > NIR] : 9.951e-11      
##                                          
##                   Kappa : 0.6472         
##                                          
##  Mcnemar's Test P-Value : 0.03817        
##                                          
##             Sensitivity : 0.9573         
##             Specificity : 0.6556         
##          Pos Pred Value : 0.9325         
##          Neg Pred Value : 0.7557         
##              Prevalence : 0.8324         
##          Detection Rate : 0.7969         
##    Detection Prevalence : 0.8546         
##       Balanced Accuracy : 0.8065         
##                                          
##        'Positive' Class : gop            
## 
features_test <- features_test %>% mutate(nb_pred = nb_predictions)

Linear Discriminant Analysis:

Finally, we fit a Linear Discriminant Classifier. This technique uses a Linear Discriminant to try to create a set of linear predictors that will describe the boundary between classes in order to create a decision boundary between classes for which each observation will be assigned. This model had the poorest performance of the three with a training accuracy of 90.67%, and testing accuracy of 90.68%. The most troubling thing about this model is that it struggles to predict the Democrat class at anything beyond random chance, which can be seen in the confusion matrix below.

lda_model <- train(form = party_2020 ~ Dim.1 + Dim.2,
                   data = features_train,
                   method = "lda2")

lda_model
## Linear Discriminant Analysis 
## 
## 2106 samples
##    2 predictor
##    2 classes: 'dem', 'gop' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2106, 2106, 2106, 2106, 2106, 2106, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9066623  0.5878589
## 
## Tuning parameter 'dimen' was held constant at a value of 1
lda_model$results
##   dimen  Accuracy     Kappa AccuracySD    KappaSD
## 1     1 0.9066623 0.5878589 0.01119217 0.04649203
summary(lda_model)
##             Length Class      Mode     
## prior       2      -none-     numeric  
## counts      2      -none-     numeric  
## means       4      -none-     numeric  
## scaling     2      -none-     numeric  
## lev         2      -none-     character
## svd         1      -none-     numeric  
## N           1      -none-     numeric  
## call        3      -none-     call     
## xNames      2      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list
lda_model$finalModel
## Call:
## lda(x, grouping = y)
## 
## Prior probabilities of groups:
##       dem       gop 
## 0.1680912 0.8319088 
## 
## Group means:
##          Dim.1      Dim.2
## dem  2.3543217  2.6806171
## gop -0.4254582 -0.6344432
## 
## Coefficients of linear discriminants:
##              LD1
## Dim.1 -0.3071837
## Dim.2 -0.4428649
#builds the confusion matrix for the LDA model
lda_predictions <- predict(lda_model, features_test)

confusionMatrix(data = lda_predictions,
                reference = features_test$party_2020,
                positive = "gop")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dem gop
##        dem  79  12
##        gop  72 738
##                                          
##                Accuracy : 0.9068         
##                  95% CI : (0.8859, 0.925)
##     No Information Rate : 0.8324         
##     P-Value [Acc > NIR] : 9.951e-11      
##                                          
##                   Kappa : 0.6028         
##                                          
##  Mcnemar's Test P-Value : 1.215e-10      
##                                          
##             Sensitivity : 0.9840         
##             Specificity : 0.5232         
##          Pos Pred Value : 0.9111         
##          Neg Pred Value : 0.8681         
##              Prevalence : 0.8324         
##          Detection Rate : 0.8191         
##    Detection Prevalence : 0.8990         
##       Balanced Accuracy : 0.7536         
##                                          
##        'Positive' Class : gop            
## 
features_test <- features_test %>% mutate(lda_pred = lda_predictions)

Conclusion:

In this project, we set out to create a non-spatially dependent coordinate system on which we could locate each county in the United States based on social, ideological, economic, and demographic features of the population. Data was collected from several sources and engineered by selecting relevant features, transforming features, and then finally centering and scaling the data to create a unit-less representation of each of the source features on which we chose to model. We then performed a Principal Components Analysis on the source data in order to create a transformed set of orthogonal features from which we selected the two strongest in terms of explained variance in the data. These two components became our dimensions in which we would locate our counties. Analysis of the chosen features showed that our chosen dimensions serve well in clustering counties by population differences without respect to any spatial dependence.

In analyzing our transformed features, we noted that some of the key demographic traits of each county performed well as the strongest indicators of where a county was assigned in our new coordinate system. Factors such as education and economic indicators tended to affect our components in one dimension, while opposing factors such as ethnicity, political affiliation, and ideological factors tended to affect the components in the other. Our analysis of the resultant county class assignments when superimposed on a map of the United States showed that our dimensions were able to tease out more subtle differences in the populations of each of the counties in the United States which were included in our analysis. We saw that rural areas were different from population centers across the country, and there were even noticeable trends in the regional differences in populations. We also saw that the regional trends were more pronounced for rural areas, as population centers tended to be more similar regardless of the physical region in which they are located.

Finally, we sought to quantify the strength of our new coordinate system. In order to do this, we chose to build several classifiers which would attempt to predict the results of the 2020 Presidential election cycle based solely on our two created feature dimensions. Based on the clusters that emerged from the resultant dimensions from our PCA, we chose a Support Vector Machine as the primary classifier by which to make these predictions, with Naive Bayes and Linear Discriminant Classifiers added for comparison. All three classifiers performed well, with the SVM making predictions with the strongest accuracy, about 93.23% on unseen data. We also saw that our models suffered slightly from the presence of class imbalance in our dichotomous classes of “GOP” and “DEM”, with the “GOP” class accounting for about 83% of all observations in our data. Overall, we can say that our reduced dimensionality coordinate system that was created by selecting the first two principal components is an accurate representation of the given factors that distinguish counties from one another based on the results of our classifier predictions.

Future Work and Limitations:

Moving forward, while this data could be used directly by researchers who are modeling differences in different counties, depending on the application, it may be necessary to recreate the coordinate system by including more relevant predictors to the analysis being performed. The data set created in the pre-processing steps captures only a few of the factors that differentiate populations across different areas, so future projects may seek to combine additional, or different sources of data to describe the underlying latent differences. This technique could also be applied to completely different sets of predictors as well, in order to create a coordinate system for each county which would describe factors in other domains. For instance, if one were interested in making class predictions in the agricultural market for the purposes of sales forecasting, then domain-specific predictors could be used in place of the demographic features seen here. This technique could also be applied to explain differences in non population-based domains, such as physical features, economic makeup, etcetera. Another one of the major limitations of this technique is the way in which data was collected. In order to locate data which would describe such a vast number of factors, data from several different sources was required. If this technique were to be improved upon, source data would be one of the first areas to address.